ABSTRACT

One of the ways of learning from the data which is physically distributed over multiple locations is to have a common learning mechanism at each of the source and knowledge of each of the learnt concepts has to be transmitted to a centralized location for assimilation. In this research, clustering is employed as a mechanism of learning and a cluster is viewed as a concept which is described by a set of variables. The set of variables which describes each of the clusters is being referred to as a knowledge packet (KP). As histograms have the generic ability to characterize any type of data, a histogram based regression line has been used as one of the variable to describe a KP. For online monitoring of the progression in learning apart from achieving computational ease and efficacy, the KPs at the centralized location are fused incrementally to get the overall knowledge. If learning mechanisms employed are data sequence sensitive, different combinations of merging the thus generated KPs may result in altogether a different overall knowledge. Further, the distance measure employed to find distance between the KPs in obtaining the optimal sequence of merging, may also result in a different overall knowledge. This phenomenon is being referred to as the problem of order effect. To minimize or avoid the order effect, a density based spatial clustering of applications with noise (DBSCAN) algorithm, which is insensitive to the order of presentation of data samples is used to learn from the data chunks and a novel methodology of finding the distance between the batches of data and there by finding the more optimal sequence of merging the KPs is presented. A specially designed distance measure for histogram based objects (histo-objects) is employed to find distance between the KPs and the nearest KPs are merged incrementally till certain conditions are satisfied. The proposed methods provide a robust mechanism of avoiding order effects. Since it is difficult to get the real distributed datasets, effectiveness of the proposed approaches is demonstrated with a carefully designed synthetic dataset. Some of the bench mark datasets were modified to simulate the distributed environment and experimentations with some of them show an accuracy of up to 100%.

Keywords: - Cluster analysis, Incremental augmentation of knowledge, Order effect, Regression Analysis